Learning Paraphrase Models from Google New Headlines

نویسنده

  • Dave Kale
چکیده

Data sources like the clusters of news headlines at Google News present an exciting opportunity to learn paraphrase models from data automatically. We present both a novel dataset and a novel approach to automatic, unsupervised learning of paraphrase models from that datset. Leveraging existing NLP tools such as the Stanford Parser and lexical resources such as WordNet and Infomap, we constructed a system that first aligns the typed dependency graphs of large numbers of parallel headlines (on the order of hundreds) and then uses aligned paths between corresponding nodes as candidates to a paraphrase extraction system. We present some preliminary results in the form of actual learned paraphrase models. This project serves as a proof of concept for this approach and sheds some light on likely next steps.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Paraphrasing Headlines by Machine Translation

In this paper we investigate the automatic collection, generation and evaluation of sentential paraphrases. Valuable sources of paraphrases are news article headlines; they tend to describe the same event in various different ways, and can easily be obtained from the web. We describe a method for generating paraphrases by using a large aligned monolingual corpus of news headlines acquired autom...

متن کامل

Paraphrase Generation as Monolingual Translation: Data and Evaluation

In this paper we investigate the automatic generation and evaluation of sentential paraphrases. We describe a method for generating sentential paraphrases by using a large aligned monolingual corpus of news headlines acquired automatically from Google News and a standard Phrase-Based Machine Translation (PBMT) framework. The output of this system is compared to a word substitution baseline. Hum...

متن کامل

A contrastive review of paraphrase acquisition techniques

This paper addresses the issue of what approach should be used for building a corpus of sentential paraphrases depending on one’s requirements. Six strategies are studied: (1) multiple translations into a single language from another language; (2) multiple translations into a single language from different other languages; (3) multiple descriptions of short videos; (4) multiple subtitles for th...

متن کامل

Sentence Paraphrase Graphs: Classification Based on Predictive Models or Annotators' Decisions?

As part of our project ParaPhraser on the identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases. The corpus is annotated via crowdsourcing by naïve native Russian speakers, but from the point of view of the expert, our complex paraphrase detection model can be more successful at predict...

متن کامل

Clustering and Matching Headlines for Automatic Paraphrase Acquisition

For developing a data-driven text rewriting algorithm for paraphrasing, it is essential to have a monolingual corpus of aligned paraphrased sentences. News article headlines are a rich source of paraphrases; they tend to describe the same event in various different ways, and can easily be obtained from the web. We compare two methods of aligning headlines to construct such an aligned corpus of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007